This is a simple guideline of the project, you can click the following link to get to the particular section and return to the guideline with the go back button.
import numpy as np
import pandas as pd
import imageio
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
import nltk.corpus
from nltk.corpus import wordnet
import matplotlib.pyplot as plt
import requests
import lxml.html as lx
The financial market is channeling funds from people who save surplus money to those that have a shortage of money. According to figure, in the financial system, there are two ways to make the deal. One is direct finance, which is buying financial instruments directly from lenders. The other one is through financial intermediaries like banks, which are called indirect finance. The advantage of indirect finance is that financial intermediaries can help lower transaction costs and deal with asymmetric information. Therefore, the main function of the financial intermediaries is to select less risky borrowers from the existing information before the transaction and ensure the borrower will repay the loan after the transaction.
LendingClub is a bank holding company providing loan services. Although the loan has higher risk and lowers liquidity, 60% of the fund in the bank will be loaned out. Because we all know that higher risks and high returns. Therefore, the services provided by LendingClub must be the major product of the bank. Our main goal is to help the company analyze the existing information of the borrowers to avoid adverse selection which means to avoid selecting the risky borrower with the key question: what are the main factors that can affect the risk level of a borrower?
Figure 1: the financial market system
The dataset contains 2,260,701 rows, which is relatively huge. Since we want to predict whether the borrower has the ability to repay the loan, we use the loan status as the independent variable. Also, we only focus on the two statuses of the loan: fully paid and charged off, which represent being able to repay and unable to repay respectively. Thus, only 1,345,310 rows are available for us. Then, we select the dependent variables that might be significant. By understanding the concept of loans, we could screen out 15 potential numeric variables including the loan amount, the fund amount, the interest, the monthly payment, the annual income, the ratio of debt to income, and so on. In addition, we screen out 9 categorical variables that might be significant to predict the loan status including the term, the grade, the homeownership, the purpose, the location, and so on.
# select potential variables
LC_loan = pd.read_csv(r'C:\Users\daiyh\OneDrive\Documents\loan_project\LC_loan.csv',
usecols=["loan_amnt","funded_amnt","funded_amnt_inv","term",
"int_rate","installment","grade","emp_length",
"home_ownership" ,"annual_inc","verification_status",
"purpose","addr_state","dti","earliest_cr_line",
"inq_last_6mths","open_acc","pub_rec","revol_bal",
"revol_util","total_acc","total_rec_prncp",
"total_rec_int","loan_status"])
LC_loan.head()
| loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | emp_length | home_ownership | annual_inc | ... | dti | earliest_cr_line | inq_last_6mths | open_acc | pub_rec | revol_bal | revol_util | total_acc | total_rec_prncp | total_rec_int | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3600.0 | 3600.0 | 3600.0 | 36 months | 13.99 | 123.03 | C | 10+ years | MORTGAGE | 55000.0 | ... | 5.91 | Aug-2003 | 1.0 | 7.0 | 0.0 | 2765.0 | 29.7 | 13.0 | 3600.00 | 821.72 |
| 1 | 24700.0 | 24700.0 | 24700.0 | 36 months | 11.99 | 820.28 | C | 10+ years | MORTGAGE | 65000.0 | ... | 16.06 | Dec-1999 | 4.0 | 22.0 | 0.0 | 21470.0 | 19.2 | 38.0 | 24700.00 | 979.66 |
| 2 | 20000.0 | 20000.0 | 20000.0 | 60 months | 10.78 | 432.66 | B | 10+ years | MORTGAGE | 63000.0 | ... | 10.78 | Aug-2000 | 0.0 | 6.0 | 0.0 | 7869.0 | 56.2 | 18.0 | 20000.00 | 2705.92 |
| 3 | 35000.0 | 35000.0 | 35000.0 | 60 months | 14.85 | 829.90 | C | 10+ years | MORTGAGE | 110000.0 | ... | 17.06 | Sep-2008 | 0.0 | 13.0 | 0.0 | 7802.0 | 11.6 | 17.0 | 19102.35 | 12361.66 |
| 4 | 10400.0 | 10400.0 | 10400.0 | 60 months | 22.45 | 289.91 | F | 3 years | MORTGAGE | 104433.0 | ... | 25.37 | Jun-1998 | 3.0 | 12.0 | 0.0 | 21929.0 | 64.5 | 35.0 | 10400.00 | 1340.50 |
5 rows × 24 columns
# only focus on fully paid and charged off for independent variable loan_status
dum_status = []
for status in LC_loan['loan_status']:
if status == 'Fully Paid':
dum_status.append("Fully Paid")
elif status == 'Charged Off':
dum_status.append("Charged Off")
else:
dum_status.append(2)
LC_loan['loan_status'] = dum_status
LC_loan = LC_loan[LC_loan['loan_status'] != 2]
LC_loan.reset_index()
loan_status = pd.DataFrame(LC_loan.loc[:,'loan_status'].value_counts()).reset_index()
loan_status.columns = ['loan status', 'number of counts']
loan_status
| loan status | number of counts | |
|---|---|---|
| 0 | Fully Paid | 1076751 |
| 1 | Charged Off | 268559 |
By performing a heatmap using seaborn for the correlation analysis between numeric features (figure 2), we observe that there are several groups of variables that have relatively high correlations. The loan amount (loan_amount) has a high correlation with the installment (installment); it is reasonable that the borrowers with installment loans have a relatively high amount of loan. Besides, the payments received to date for the total amount funded (total_pymnt) and the portion of the payment received to date of the total amount funded by investors (total_pymnt_inv) have high intercorrelation. The number of open credit lines (open_acc) and the total number of open credit currently in the borrower’s credit file (total_acc) are also highly correlated.
LC_loan_num = LC_loan[["loan_amnt","int_rate","installment","annual_inc","dti",
"inq_last_6mths","open_acc","pub_rec","revol_bal",
"revol_util","total_acc","total_rec_prncp","total_rec_int"]]
corrMatrix = LC_loan_num.corr()
sns.heatmap(corrMatrix, cmap="YlGnBu")
plt.show()
Figure 2: correlation between numeric features
By constructing a barplot of the different purposes of loans (figure 3) using matplotlib, we could see that the Top 2 purposes of individuals applying for loans are credit cards and debt consolidation. The purpose of debt consolidation shows an absolute leading trend with nearly 80,000 counts, while the second-largest purpose credit card consists of approximately 30,000 counts. There is an interesting fact that the auto loan only plays a small part in the total applications. It’s about the same amount as the people who need the loan for vacation plus moving.
purpose = LC_loan['purpose'].value_counts().keys().tolist()
purpose_count = LC_loan['purpose'].value_counts().tolist()
fig = plt.figure(figsize = (10, 5))
plt.barh(purpose[:10], purpose_count[:10], color = (0.2, 0.4, 0.6, 0.6))
plt.xlabel("Purpose of Applying Loan")
plt.ylabel("Amount of Purpose")
plt.title("Top 5 Purpose of Loan")
plt.show()
Figure 3: barplot of purpose
According to the scatter plot (figure 4), we can see that the interest rate increases as the ordinal grade from A to G increases, which means a higher grade has a negative trend in the interest rate. This might influence the ability of a borrower to pay back the loan. However, in the violin chart (figure 5), it is obvious to see that the loan amount applied by the borrower with a higher grade tends to apply for a larger amount of loan.
Loan_1000 = LC_loan.loc[0:1000,["loan_amnt","int_rate","grade"]]
sns.lmplot(x='int_rate', y='loan_amnt', data = Loan_1000 , hue='grade', fit_reg=False, palette="Set3")
<seaborn.axisgrid.FacetGrid at 0x1945fd94c10>
Figure 4: scatterplot of loan amount and interest rate by grade
sns.violinplot(x = "grade", y = "loan_amnt", data = LC_loan, palette="Set3")
<AxesSubplot:xlabel='grade', ylabel='loan_amnt'>
Figure 5: violin chart of loan amount by grade
We then construct the pie chart (figure 6) of the top 3 major homeownership and the barplot of the loan status by homeownership (figure 7) using matplotlib. The pie chart shows that around half of the homeownership is owned by taking mortgages. In addition, among the rest of the people, the number of people who rent their home is about 3 times more than the people who own their home. The barplot here shows the proportion of the loan status in different homeownership. The people who buy their own houses have the highest proportion of the people
home_ownership = LC_loan['home_ownership'].value_counts().keys().tolist()[0:3]
home_ownership_count = LC_loan['home_ownership'].value_counts().tolist() [0:3]
home_ownership_lable = [home_ownership[i]+ ": " + str(home_ownership_count[i]) for i in range(3)]
plt.figure(figsize=(5, 5))
c_colors = ['#D4E6F1', '#7FB3D5','#A9CCE3']
plt.pie(home_ownership_count, labels=home_ownership_lable, colors = c_colors)
central_circle = plt.Circle((0, 0), 0.5, color='white')
fig = plt.gcf()
fig.gca().add_artist(central_circle)
plt.rc('font', size=12)
plt.title("Home Ownership", fontsize=20)
plt.show()
Figure 6: Pie chart of major homeownership
MORTGAGE = LC_loan['loan_status'][LC_loan['home_ownership'] == 'MORTGAGE'].value_counts()
RENT = LC_loan['loan_status'][LC_loan['home_ownership'] == 'RENT'].value_counts()
OWN = LC_loan['loan_status'][LC_loan['home_ownership'] == 'OWN'].value_counts()
bars1 = [MORTGAGE["Fully Paid"],RENT["Fully Paid"],OWN["Fully Paid"]]
bars2 = [MORTGAGE["Charged Off"],RENT["Charged Off"],OWN["Charged Off"]]
r = [0,1,2]
names = home_ownership
barWidth = 0.7
fig, ax = plt.subplots(figsize=(5,5))
plt.bar(r, bars1, color='#A9DFBF', edgecolor='white', width=barWidth)
plt.bar(r, bars2, bottom=bars1, color="#D4EFDF", edgecolor='white', width=barWidth)
plt.xticks(r, names, fontweight='bold')
plt.xlabel("Home Ownership")
plt.legend(["Fully Paid","Charged Off"])
plt.show()
Figure 7: barplot of loan status by homeownership
Below is a map (figure 8) that shows the total amount of loans in the United States issued by states. According to the map, California has the highest amount of loans. Besides, New York, Florida, and Texas also have large-scale issued loans. Those are the top four states of the loan amount, which is much larger than the other states.
location = LC_loan.groupby("addr_state",).sum()
location = location.reset_index()
location = location[["addr_state", "loan_amnt"]]
location['loan_amnt'] = [float(amount) for amount in location['loan_amnt']]
fig = go.Figure(data=go.Choropleth(locations = location['addr_state'],z = location['loan_amnt'].astype(float),
colorscale = 'Blues', locationmode = 'USA-states', colorbar_title = "USD"))
fig.update_layout(title_text = 'Loan Issued by State', title_font_size=25, geo_scope='usa')
fig.show()
Figure 8: loan amount distribution map
We plot four countplots of loan status by some potential variables that we think are significant in the loan status prediction (figure 9). The top left graph shows that the longer the years of employment, the higher the proportion for the borrower to fully repay the loan. The top right graph shows that the loan status for purpose seems to have the same distribution. The bottom left graph shows that the borrowers with higher grades are more willing to fully repay the loan. The bottom right graph shows that loans with shorter terms have more proportion in the fully paid in loan status, which means long-term loans have a higher risk because of the unpredictable future.
fig,[[ax1, ax2],[ax3, ax4]] = plt.subplots(2,2, figsize=[20,15])
sns.countplot(x='loan_status', data = LC_loan, hue='emp_length', ax = ax1)
sns.countplot(x='loan_status', data = LC_loan, hue='purpose', ax = ax2)
sns.countplot(x='loan_status', data = LC_loan, hue='grade', palette = "Set2", ax = ax3)
sns.countplot(x='loan_status', data = LC_loan, hue='term', palette = "Set2", ax = ax4)
<AxesSubplot:xlabel='loan_status', ylabel='count'>
Figure 9: countplots of loan status by four potential categorical features
To understand more about the borrowers, we extract and analyze the description of the loan application. We collect all the short paragraphs left by the borrowers and use the standardization techniques of natural language processing including word tokenization, lowercasing, lemmatization, and stop words removal. Then, we plot the distribution of the term frequency. According to figure 10, we can conclude that the most main purpose of the loan application in LendingClub is the monthly credit card payment and the debt consolidation.
# read the description of loan application
LC_loan_desc = pd.read_csv(r'C:\Users\daiyh\OneDrive\Documents\loan_project\LC_loan.csv', usecols=['desc'])
LC_loan_desc = LC_loan_desc.dropna()
LC_loan_desc = LC_loan_desc.reset_index(drop = True)
# word tokenization
desc_str = ''.join(LC_loan_desc["desc"].values)
desc_word = nltk.word_tokenize(desc_str)
# lowercasing
desc_word_lower = [word.lower() for word in desc_word]
# lemmatization
lemmatizer = nltk.WordNetLemmatizer()
def wordnet_pos(tag):
"""If the tag start with "N","V","R", and "J", convert them into pos_tag."""
table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
return table.get(tag[0], wordnet.NOUN)
word_and_tag = nltk.pos_tag(desc_word_lower)
desc_lemmatized_word = [lemmatizer.lemmatize(word,wordnet_pos(tag)) for (word, tag) in word_and_tag]
# removing stopwords
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words("english")
desc_remove_stopword = [word for word in desc_lemmatized_word if word not in stopwords]
# plot the freq distribution
desc_20_fq = nltk.FreqDist(word for word in desc_remove_stopword if word.isalnum())
fig, ax = plt.subplots(figsize=(15,5))
desc_20_fq.plot(70, title='The Frequency Distribution for Tokens')
plt.show()
C:\Users\daiyh\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3444: DtypeWarning: Columns (19) have mixed types.Specify dtype option on import or set low_memory=False. [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\daiyh\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
Figure 10: frequency distribution for tokens
We used API accessing with keys to extract two data sets which are the monthly loans and leases of commercial banks and the loan rate or the monthly average loan rate from 2007 to 2018 in the Nasdaq Data Link: https://data.nasdaq.com/data/FRED-federal-reserve-economic-data. We convert the dates to periods and construct two data frames for each of the two data sets. Then, we plot the data using matplotlib. In figures 11 and 12, we can see that the rapid increase in loan amounts and the big drop in the interest rate continued until 2016 because the central bank decided to increase the interest rate to fight against inflation from December 2015. Therefore, the increase in loan amount is slowing down and the interest rate has a small increase after that.
# get the key
def read_key(keyfile):
with open(keyfile) as f:
return f.readline().strip("\n")
# key = read_key("api-keys/nasdaq.txt")
key='sY_c6PzsF-4fdXxfmwU9'
# access API of time series loan amount for commercial bank
response = requests.get("https://data.nasdaq.com/api/v3/datasets/120925/data", params = {
"api_key": key ,
"start_date":'2007-01-01',
"end_date":'2018-12-01'
})
response.raise_for_status()
data=response.json()['dataset_data']
# convert to dataframe
df1 = pd.DataFrame(data['data'],columns=['Date','Value'])
df1["Date"] = pd.to_datetime(df1["Date"])
df1=df1.loc[::-1]
# access API of interest rate for United States
response2 = requests.get("https://data.nasdaq.com/api/v3/datasets/25238617/data", params = {
"api_key": key ,
"start_date":'2007-01-01',
"end_date":'2018-12-01'
})
response2.raise_for_status()
data2=response2.json()['dataset_data']
# convert to dataframe
df2 = pd.DataFrame(data2['data'],columns=['Date','Average Loan Rate (percent)'])
df2=df2.iloc[::-1]
df2["Date"] = pd.to_datetime(df2["Date"])
df2.head()
# plot two dataframes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
ax1.plot_date(df1['Date'],df1['Value'],'c')
ax1.axvline(x=16785,color = 'orange',linestyle='dashed')
ax1.set(title='Commercial Banks Loans',xlabel = 'Date', ylabel = 'Total Loans Amount')
ax2.plot_date(df2['Date'],df2['Average Loan Rate (percent)'],'c')
ax2.axvline(x=16785,color = 'orange',linestyle='dashed')
ax2.set(title = 'Loan rate of Commercial Banks', xlabel = 'Date', ylabel = 'Average Loan Rate (percent)')
plt.show()
Figure 11: commercial banks monthly loans amounts Figure 12: loan rate of commercial banks
How do the financial crisis and the monetary policy affect the loan amount and the interest rate? According to the New York Times, in the financial crisis of 2008, the central government purchased large-scale long-term bonds from the public to increase the demand of the bond market. According to figure 13, as the demand increased, the bond price increased and the interest rate decreased since the price of the bond is negatively related to the interest rate. In addition, since the central bank purchased a large amount of loan, the loan went back to the government and the money was flowing into the market. Thus, we can see in figure 14, the interest rate decreased as the money supply increased. However, as the interest rate decreased, the cost of borrowing decreased, and more loans were issued by the commercial banks. Therefore, we can see that in figures 11 and 12, the interest rate kept decreasing between 2008 and 2016, which led to an increase in loan amount during this period.
Figure 13: bond market Figure 14: money market Figure 15: whole market
Using web scraping techniques, we got the historical mortgage rates in the US for the last 11 years in Value Penguin: https://www.valuepenguin.com/mortgages/historical-mortgage-rates. The mortgage rate is also a kind of interest rate of the mortgage loan. We also construct the data frame and plot the data for this data set we drown above. Since a mortgage loan is also a long-term loan, it is a loan included in the expansionary monetary policy called quantitative easing. The government bought long-term bonds including mortgage loans to help the economy during the financial crisis to lower the interest rate in 2008. We can see that from the time series plot, the interest rate of mortgage loans kept getting lower until 2016 the government started to tighten the monetary policy. Then, the mortgage rate went up. We are where we were in 2016, the bank should take action to respond to the increase of the interest rate of loans.
According to figure 16, we can see that since the aggregate demand increased, the inflation rate (π) kept increasing. Inflation would lead to serious unemployment and price hike problems. As a result, the government used quantitative tightening to increase the interest rate, and the situation reversed after 2016, which matches the change in figures 11 and 12. In the coronavirus financial crisis in the past two and half years, the central bank used the same strategy to stimulate the economy. Therefore, the inflation rate increased a lot. The central bank said that the tight monetary policy will happen from March 2022, which means the interest rate will increase in the short future. In that case, the demand for the loan will decrease since the cost of borrowing will increase. Thus, the company should prepare for the change. For example, the company can ease the term of borrowing to minimize the loss of the price-sensitive customers.
# web scraping of mortgage rate
response3 = requests.get("https://www.valuepenguin.com/mortgages/historical-mortgage-rates")
response3.raise_for_status
html = lx.fromstring(response3.text)
html.make_links_absolute('https://www.valuepenguin.com/mortgages/historical-mortgage-rates')
tab = html.xpath("//table")[0]
# get header
h, b = tab.getchildren()
rows = h.xpath(".//tr")
header = rows[0]
header = [x.text for x in header.xpath(".//span")]
# get body
def extract_row(x):
rows = b.xpath(".//tr")
header = rows[x]
header = [x.text for x in header]
return header
body=[extract_row(x) for x in range(12)]
# convert to dataframe
df3 = pd.DataFrame(body, columns = header)
df3 = df3.iloc[::-1]
year = pd.to_datetime(df3["Year"], format= '%Y')
year = year.dt.strftime("%Y")
df3['Year'] = year
df3['Lowest Rate'] = df3['Lowest Rate'].str[:-1].astype(float)
df3['Highest Rate'] = df3['Highest Rate'].str[:-1].astype(float)
df3['Average Rate'] = df3['Average Rate'].str[:-1].astype(float)
# plot mortgage rate
fig = plt.subplots(figsize=(10, 5))
plt.plot(df3['Year'],df3['Highest Rate'],'c')
plt.plot(df3['Year'],df3['Average Rate'],'y')
plt.plot(df3['Year'],df3['Lowest Rate'],'g')
plt.axvline(x=8,color = 'orange',linestyle='dashed')
plt.title('Historical Mortgage Rates', fontweight="bold")
plt.xlabel('Date')
plt.ylabel('total Loans and Leases amount')
plt.legend(['Highest Rate', 'Average Rate', 'Lowest Rate'])
<matplotlib.legend.Legend at 0x1949ebcc2b0>
Figure 16: historical mortgage rates
In the total 24 features, only four dependent features have missing values. We construct a table of the proportion of the missing values for each variable. In the table, the feature employment length in years has the highest proportion of missing values, but 5.8359% is not also small. Due to the large data size, we decide to drop all rows with null values.
# percentage of missing values of each column
print(LC_loan.isnull().sum() * 100 / LC_loan.shape[0])
LC_loan = LC_loan.dropna() # drop rows with na
loan_amnt 0.000000 funded_amnt 0.000000 funded_amnt_inv 0.000000 term 0.000000 int_rate 0.000000 installment 0.000000 grade 0.000000 emp_length 5.835904 home_ownership 0.000000 annual_inc 0.000000 verification_status 0.000000 loan_status 0.000000 purpose 0.000000 addr_state 0.000000 dti 0.027800 earliest_cr_line 0.000000 inq_last_6mths 0.000074 open_acc 0.000000 pub_rec 0.000000 revol_bal 0.000000 revol_util 0.063703 total_acc 0.000000 total_rec_prncp 0.000000 total_rec_int 0.000000 dtype: float64
We convert all the categorical values into binary and multi-binary variables. For numeric values, we drop four dependent variables that have high correlations with other variables including the monthly payment (installment), the total amount funded (total_pymnt), the portion of the payment received to date of the total amount funded by investors (total_pymnt_inv), and the total number of open credit currently in the borrower’s credit file (total_acc) based on the correlation heatmap we construct in the visualization section. Then, we split the dataset into two groups, 80% of the dataset as the training group and 20% as the testing group, for model training and model testing.
# convert independent variable into binary variable
dum_status_dum = []
for status in LC_loan['loan_status']:
if status == 'Fully Paid':
dum_status_dum.append(0)
elif status == 'Charged Off':
dum_status_dum.append(1)
LC_loan['loan_status_dum'] = dum_status_dum
LC_loan = LC_loan.drop(['loan_status'], axis=1)
# categotical feature: term
print(LC_loan['term'].unique()) # only two type of maturity
# extract the number of months
LC_loan["term"] = [int(duration[1:3]) for duration in LC_loan['term']]
# categotical feature: emp_length
print(LC_loan['emp_length'].unique())
emp_len = [year[0:2].strip() for year in LC_loan['emp_length']]
num_emp_len = []
for length in emp_len:
if length == "<":
num_emp_len.append('0')
else:
num_emp_len.append(length)
# extract the number of years
LC_loan['emp_length'] = [int(num) for num in num_emp_len]
# other categotical features
def dum_map(col_name): # map the variable into dummy varibales
mapping = {num:index for index,num in enumerate(np.unique(LC_loan[col_name]))}
LC_loan[col_name] = LC_loan[col_name].map(mapping)
dum_map("grade")
dum_map("home_ownership")
dum_map("verification_status")
dum_map("purpose")
dum_map("addr_state")
dum_map("earliest_cr_line")
[' 36 months' ' 60 months'] ['10+ years' '3 years' '4 years' '6 years' '7 years' '8 years' '2 years' '5 years' '9 years' '< 1 year' '1 year']
# drop features with high correlation
LC_loan = LC_loan.drop(["installment"], axis=1)
LC_loan = LC_loan.drop(["total_rec_prncp"], axis=1)
LC_loan = LC_loan.drop(["total_rec_int"], axis=1)
LC_loan = LC_loan.drop(["open_acc"], axis=1)
LC_loan = LC_loan.reset_index(drop = True)
# independent variable: loan status
y = LC_loan["loan_status_dum"]
# dependent variables
X = LC_loan.loc[:, LC_loan.columns != 'loan_status_dum']
# splitting train and test groups
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
The independent variable is not normally distributed so we need to use a non-parametric model for this dataset. According to figure 17, we think the random forest model that can handle skewness is an appropriate model. We train the random forest model with the training dataset and use the fitted model to predict the independent variable loan status with the testing group of the dependent variables. Then, we compare the predicted independent variable with the testing group of the independent variable. There is an 81% accuracy, which means the fitness of the model is relatively high.
# distribution of the predict variable
plt.barh(loan_status['loan status'],loan_status['number of counts'], color = (0.2, 0.4, 0.6, 0.6))
plt.xlabel("Loan Status")
plt.ylabel("Number of Counts")
plt.title("Distribution of Predict Variable")
plt.show()
Figure 17: distribution of predict variable
For logistic regression, the predicted variable is binary and we have already removed the highly correlated features so we meet the assumption of the logistic regression. We also fit the logistic model with the training group and compare the independent variables in the testing group with the predicted variable of the model. The accuracy is also about 81%.
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# random forest model
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
RF_predict = RF.predict(X_test)
print('random forest model:\n {}'.format(classification_report(y_test,RF_predict)))
# logistic regression model
LR = LogisticRegression()
LR.fit(X_train, y_train)
LR_predict = LR.predict(X_test)
print('logistic regression model:\n {}'.format(classification_report(y_test,LR_predict)))
random forest model:
precision recall f1-score support
0 0.81 0.98 0.89 203916
1 0.51 0.07 0.13 49279
accuracy 0.81 253195
macro avg 0.66 0.53 0.51 253195
weighted avg 0.76 0.81 0.74 253195
logistic regression model:
precision recall f1-score support
0 0.81 0.98 0.89 203916
1 0.50 0.08 0.13 49279
accuracy 0.81 253195
macro avg 0.66 0.53 0.51 253195
weighted avg 0.75 0.81 0.74 253195
Since both the random forest model and the logistic model show to have the relatively same accuracy, we try to use cross-validation to select one model to fit with the whole dataset. According to the cross-validation, the random forest model tends to have 80.61% accuracy and the logistic regression model has 80.52% accuracy. Thus, we choose to use a random forest model for the following analysis.
# Random Forest
RF_cross_validation_score = model_selection.cross_val_score(RF, X_train, y_train, cv=5)
# Logistic Regression
LR_cross_validation_score = model_selection.cross_val_score(LR, X_train, y_train, cv=5)
# result
print('Random Forest model accuracy: {:2.2%}'.format(RF_cross_validation_score.mean()))
print('Logistic Regression model accuracy: {:2.2%}'.format(LR_cross_validation_score.mean()))
Random Forest model accuracy: 80.61% Logistic Regression model accuracy: 80.52%
We fit a random forest model with the whole dataset and calculate the important score of every dependent variable. We then sort the scores in descending order and plot them with a barplot. According to figure 18, the ratio of debt to income (dti), the credit balance (revol_bal), and the interest rate (int_rate) are 3 of the most significant features. Also, it is interesting to know that the month the borrower opens the account (earlist_cr_line) and the location of the borrower (addr_state) also have a relatively high important score in predicting the ability to repay the loan.
# fit the random forest model with the whole dataset
RF.fit(X, y)
# get significance scores of dependent variables
importance_RF = RF.feature_importances_
X_colnames = list(X)
importence_dict = dict(zip(X_colnames,importance_RF))
index = np.argsort(importance_RF)[::-1]
key_order = [X_colnames[i] for i in index]
value_order = [importence_dict[k] for k in key_order]
for l in range(len(key_order)):
i = key_order[l]
v = value_order[l]
print('{}: {:.5}'.format(i,v))
# plot significance scores
fig = plt.subplots(figsize=(10, 5))
plt.bar(key_order, value_order, align='center', alpha=0.5)
plt.xticks(rotation = 90)
plt.ylabel('importance score')
plt.title('Feature Importances')
plt.show()
dti: 0.096866 revol_bal: 0.092937 int_rate: 0.092742 revol_util: 0.089641 earliest_cr_line: 0.08628 annual_inc: 0.079183 total_acc: 0.072222 addr_state: 0.063241 funded_amnt_inv: 0.05089 loan_amnt: 0.047023 funded_amnt: 0.047007 emp_length: 0.041765 grade: 0.032951 purpose: 0.027443 inq_last_6mths: 0.022243 verification_status: 0.017156 term: 0.015211 pub_rec: 0.012669 home_ownership: 0.012531
Figure 18: barplot of feature significant score
In conclusion, if LendingClub wants to provide a more precise prediction of the risk level of the borrowers for the customers, the company can focus more on the proportion of debt to the income and the credit balance of the borrower. Also, since the interest rate is the third most important feature and the interest rate will be increased based on the API analysis we did, the company should pay more attention to the monetary policy announced by the Federal Reserve.
Although the accuracy of the random forest model is higher than the logistic model and we use the random forest model to compare the significance of the dependent features, the difference in the accuracy of the two models is nuanced. For the future study, we can use the $L_1$ regulation to avoid the conservation of the feature selection since $L_1$ may directly set the coefficient of the dependent variable into zero if it evaluates the variable as not significant.
Ceizyk, D. (2022, February 25). Historical mortgage rates: Averages and trends from the 1970s to 2019. ValuePenguin. Retrieved March 13, 2022, from https://www.valuepenguin.com/mortgages/historical-mortgage-rates
Economist's view. Economist's View: The Challenge of Sovereign Wealth Funds. (n.d.). Retrieved March 13, 2022, from https://economistsview.typepad.com/economistsview/2008/01/the-challenge-o.html
George, N. (2019, April 10). All lending club loan data. Kaggle. Retrieved March 13, 2022, from https://www.kaggle.com/wordsforthewise/lending-club
Nasdaq Data Link. data.nasdaq.com. (n.d.). Retrieved March 13, 2022, from https://data.nasdaq.com/data/FRED-federal-reserve-economic-data
Sommer, J. (2022, January 28). What may be in store as the Fed cuts back on the easy money. The New York Times. Retrieved March 13, 2022, from https://www.nytimes.com/2022/01/28/business/fed-inflation-stocks-bonds.html